Supplementary Figures 5 and 7 correspond to Section 2. Supplementary Figure 6 is first mentioned in the second results section of the paper, but its content primarily relates to Section 3.
These figures are separate for page size reason, to keep a reasonable loading time.
The formatting of the figures may differ slightly from those in the paper, but they display the same data points.
All code cells are folded by default. To view any cell, click “Code” to expand it, or use the code options near the main title above to unfold all at once.
Some code may be repeated, as the original Python notebook was designed for figures to be generated semi-independently.
cross_val_analysis = pd.merge( concat_results_10fold, chrY_df, left_index=True, right_on="filename", suffixes=("", "_DROP") )cross_val_analysis.drop( columns=[c for c in cross_val_analysis.columns if c.endswith("_DROP")], inplace=True)
Define function zscore_per_assay to compute and graph the metric for each assay instead of globally.
Supp. Fig. 5B: Distribution of average z-score signal of epigenomes (dots) over chrY per sex (female in red, male in blue) for each assay individually (showing only the fold change track type for the ChIP datasets, and the two types of WGBS and RNA-seq were merged). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.
To see the RNA-Seq and WGBS results, scroll to the right using the horizontal scrollbar at the bottom of the graph.
C - Female/Male chrY signal z-score cluster separation
Define function merged_assays_separation_distance that computes and graphs the showing separation distance between male/female zscore clusters.
Code
def merged_assays_separation_distance( zscore_df: pd.DataFrame, logdir: Path |None=None, name: str|None=None) ->None:"""Complement to figure 2E, showing separation distance (mean, median) between male/female zscore clusters, for ChIP-seq (core7). Grouped by EpiRR. Args: zscore_df (pd.DataFrame): The dataframe with z-score data. logdir (Path): The directory path to save the output plots. name (str): The base name for the output plot files. """ metric_label ="chrY_zscore_vs_assay_track"# Preprocessing zscore_df = zscore_df.copy(deep=True) zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True) zscore_df = zscore_df[zscore_df[ASSAY].isin(CORE7_ASSAYS)] # type: ignore# Remove pval/raw tracks zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw"])]# Average chrY z-score values mean_chrY_values_df = zscore_df.groupby(["EpiRR", SEX]).agg( {metric_label: "mean", "Max pred": "mean"} ) mean_chrY_values_df.reset_index(inplace=True)ifnot mean_chrY_values_df["EpiRR"].is_unique:raiseValueError("EpiRR is not unique.") mean_chrY_values_df.reset_index(drop=True, inplace=True) distances = {"mean": [], "median": []} min_preds =list(np.arange(0, 1.0, 0.01)) + [0.999] sample_count = []for min_pred in min_preds: subset_chrY_values_df = mean_chrY_values_df[ mean_chrY_values_df["Max pred"] > min_pred ] sample_count.append(subset_chrY_values_df.shape[0])# Compute separation distances chrY_vals_female = subset_chrY_values_df[subset_chrY_values_df[SEX] =="female"][ metric_label ] chrY_vals_male = subset_chrY_values_df[subset_chrY_values_df[SEX] =="male"][ metric_label ]ifnot chrY_vals_female.empty andnot chrY_vals_male.empty: mean_distance = np.abs(chrY_vals_female.mean() - chrY_vals_male.mean()) median_distance = np.abs(chrY_vals_female.median() - chrY_vals_male.median()) distances["mean"].append(mean_distance) distances["median"].append(median_distance)else: distances["mean"].append(np.nan) distances["median"].append(np.nan)# Plotting the results fig = go.Figure()# Add traces for mean and median distances fig.add_trace( go.Scatter( x=min_preds, y=distances["mean"], mode="lines+markers", name="Mean Distance (left)", line=dict(color="blue"), ) ) fig.add_trace( go.Scatter( x=min_preds, y=distances["median"], mode="lines+markers", name="Median Distance (left)", line=dict(color="green"), ) )# Add trace for number of files fig.add_trace( go.Scatter( x=min_preds, y=np.array(sample_count) /max(sample_count), mode="lines+markers", name="Proportion of samples (right)", line=dict(color="red"), yaxis="y2", ) ) fig.update_xaxes(range=[0.499, 1.0])# Update layout for secondary y-axis fig.update_layout( title="Separation Distance of chrY z-scores male/female clusters - ChIP-Seq", xaxis_title="Average Prediction Score minimum threshold", yaxis_title="Z-score Distance", yaxis2=dict(title="Proportion of samples", overlaying="y", side="right"), yaxis2_range=[0, 1.001], legend=dict( x=1.08, ), )# Save figureif logdir:if name isNone: name ="zscore_cluster_separation_distance" fig.write_image(logdir /f"{name}.svg") fig.write_image(logdir /f"{name}.png") fig.write_html(logdir /f"{name}.html") fig.show()
Supp. Fig. 5C: Effect of a prediction score threshold on the aggregated mean (blue) and median (green) sex z-score male/female cluster distances, as well as and corresponding file subset size (red) of ChIP-related assays from panel B.
Images extracted from Epilogos viewer, using specified coordinates (XIST and FIRRE positions), and:
View mode: Paired
Dataset: IHEC
Pairwise: Male VS Female 100 samples
Saliency Metric: S1
Supp. Fig. 5E: Epilogos pairwise comparisons of male (top) vs female (bottom) showing portions of important regions for the Sex classifier, including the XIST (left) and FIRRE (right) genes.
See Annex A for a more detailled Epilogos color legend.
F - Genome browser for biospecimen important regions
Supp. Fig. 5F: Genome browser representation of the important regions shown in Figure 2I.
For full code, since the processing is more complex, see src/python/epiclass/utils/notebooks/paper/confidence_threshold.ipynb (permalink).
Supplementary Figure 6: Impact of prediction score threshold on performance metrics. Impact of prediction score threshold on accuracy, F1-score and number of files for both EpiATLAS cross-validation performance and inference on datasets from other databases with provided or extracted labels. Performance for Assay, Sex, Cancer, Biomaterial type and Life stage classifiers are shown, for EpiATLAS, ENCODE core/non-core, ChIP-Atlas and Recount3 datasets. The number of classes (C) and the number of files analyzed (N) used to calculate the performances are shown at the bottom for each graph. The 11 classes of the Assay classifiers for EpiATLAS correspond to the six ChIP-Seq histone modifications, their control Input file, and two protocols of both RNA-Seq and WGBS, while for the 9 classes of ENCODE the two protocols were grouped, and indeed only the seven ChIP-related assays were used for ChIP-Atlas and RNA-Seq for Recount3. For the Sex classifier the third class corresponds to ‘mixed’, absent for ENCODE. The Cancer classifier is binary (where non-cancer is a mix of healthy and other diseases). For the Biomaterial classifier the ‘primary cell culture’ class is missing from all public sources (but ‘primary cell’, ‘primary tissue’ and ‘cell line’ are present), while the three classes (perinatal, pediatric, adult) were always used for the Life stage classifier.
Supplementary Figure 7 - Biospecimen classifier - ChromScore for high-SHAP regions
For full code, since the processing is more complex, see src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb (permalink), particularly section “ChromScore hdf5 values”.